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ABSTRACT 

We  present  a  literature  review  on  the  problem  of  camera  network  topology  discovery, 
focussing  on  its  two  main  types:  overlapping  and  non-overlapping  camera  fields  of 
view.  These  two  problems  arc  fundamentally  different  and  each  requires  a  specifically 
tailored  approach.  We  describe  the  most  popular  approaches  for  each  problem  and 
analyse  their  suitability  for  our  project. 
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Camera  Network  Topology  Discovery  Literature  Review 

Executive  Summary 

The  INVS  (Intelligent  Networked  Video  Surveillance)  project  has  been  ongoing  in  the  ISRD  IAE 
Group  since  2007.  The  2010-2011  year  of  the  project  focuses  on  multi-camera  tracking  and  match¬ 
ing.  The  aim  is  to  demonstrate  a  live  camera  handover  system  for  a  network  of  about  10  cameras 
with  both  overlapping  and  non-overlapping  views.  Hence  an  important  decision  in  this  project  is 
how  to  handle  camera  network  topology  discovery. 

Here  we  present  a  literature  review  on  the  problem  of  camera  network  topology  discovery, 
focussing  on  its  two  main  types:  overlapping  and  non-overlapping  camera  fields  of  view.  These 
two  problems  are  fundamentally  different  and  each  requires  a  specifically  tailored  approach.  We 
describe  the  most  popular  approaches  for  each  problem  and  analyse  their  suitability  for  this  project. 
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1  Introduction 


Cameras  arc  becoming  cheaper  and  smaller  day  by  day.  There  arc  cameras  in  most  cell  phones, 
surveillance  cameras  in  subway  stations,  busy  streets  and  shopping  malls.  There  arc  even  cameras 
on  satellites  and  trucks.1 

A  camera  network  is  a  set  of  cameras  (ten  or  as  many  as  hundreds)  monitoring  some  envi¬ 
ronment  for  a  particular  purpose.  Camera  networks  may  be  the  best  way  to  obtain  time-critical 
information  in  situations  where  the  safety  of  human  lives  is  at  stake,  such  as,  terrorist  attacks  in 
a  busy  subway/airport,  natural  disaster  sites  or  urban  combat  zones.  Camera  networks  will  be 
essential  for  21st  century  military,  enviromental  and  surveillance  applications  [Devarajan,  Cheng 
&  Radke  2008]. 

Computer  networks  pose  several  research  challenges  to  the  direct  application  of  traditional 
computer  vision  algorithms.  Firstly,  computer  networks  usually  contain  tens  to  hundreds  of  cam¬ 
eras,  which  is  many  more  than  is  considered  in  many  vision  applications.  These  cameras  are  likely 
to  be  spread  over  a  wide  geographical  area.  Until  recently,  research  on  computer  networks  was 
conducted  in  controlled  environments,  where  the  cameras  were  fixed  and  their  relation  to  each 
other  was  known.  Nowadays  it  is  assumed  that  cameras  may  be  moved  intentionally  or  acciden¬ 
tally  (by  being  bumped)  and  that  their  configuration  is  not  known. 

Manual  inspection  is  an  inefficient  and  unreliable  way  to  monitor  large  camera  networks,  es¬ 
pecially  when  one  needs  to  follow  a  moving  target  in  a  crowded  scene.  In  response  to  this,  several 
systems  have  been  developed  to  automate  the  inspection  task.  A  key  paid  of  any  camera  network  is 
to  understand  the  spatial  relationships  between  cameras  in  the  network.  In  early  surveillance  sys¬ 
tems,  this  information  was  manually  specified  or  derived  during  camera  calibration.  This  process 
is  error  prone  and  time  consuming.  Furthermore,  it  is  not  robust  as  cameras  may  go  down  or  get 
moved  during  the  observation  period.  Hence  automatic  network  topology  discovery  methods  arc 
required. 


2  Topology  Estimation 

There  arc  two  main  types  of  topology  estimation  problems  [Radke  2010]:  overlapping  and  non- 
overlapping.  In  the  overlapping  problem  it  is  assumed  that  the  cameras  observe  parts  of  the  same 
environment  from  different  perspectives.  The  relationship  between  the  cameras  can  be  modeled  as 
an  undirected  graph,  called  the  vision  graph  (see  Section  2.1).  The  vision  graph  contains  an  edge 
between  two  cameras  if  they  observe  some  (or  all)  of  the  same  scene. 

In  the  non-overlapping  problem  it  is  assumed  that  no  two  cameras  observe  the  same  paid  of 
the  environment.  Relationships  between  the  cameras  arc  induced  by  the  likelihood  that  an  object 
in  one  camera  appeal's  in  another  after  some  amount  of  time.  These  relationships  can  be  modeled 
with  an  undirected  graph  called  the  communication  graph,  where  edge  weights  correspond  to 
transition  probabilities  and  times  (see  Section  2.2). 

The  vision  and  communication  graphs  (and  their  computation)  are  fundamentally  different. 
Consider  the  hypothetical  network  of  ten  cameras  in  Figure  1  [Devarajan,  Cheng  &  Radke  2008], 

'Google  Maps  tracks  record  360°  video  as  they  drive  around  cities. 
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The  presence  of  an  edge  in  the  communication  graph  does  not  imply  the  presence  of  the  same  edge 
in  the  vision  graph,  since  nearby  cameras  may  be  pointed  in  different  directions  ( e.g .,  cameras  A 
and  C).  Similarly,  cameras  looking  at  the  same  scene  may  be  physically  distant  (e.g.,  cameras  C 
and  F). 

Some  researchers  do  not  construct  the  vision  nor  communication  graphs  explicitly.  Instead, 
they  model  the  transition  probabilities  and  expected  transition  times  between  different  regions  of 
camera  views.  Such  models  allow  one  to  derive  both  the  vision  and  the  communication  graphs 
(see  Section  2.3). 


Figure  1:  (a)  A  camera  network  with  ten  cameras  (A-K)  and  their  corresponding  fields  of  view, 
(b)  The  associated  communication  graph,  (c)  The  associated  vision  graph. 


2.1  Vision  Graph  Estimation 

Mandel,  Shimshoni  &  Keren  [2007]  describe  a  simple  algorithm  for  estimating  the  vision  graph 
based  on  motion  detection.  The  authors  apply  the  sequential  probability  ratio  test  (SPRT)  to  ac¬ 
cept/reject  the  possibility  that  two  cameras  observe  the  same  scene  based  on  the  correspondence 
(or  lack  thereof)  of  the  aggregated  detections.  Although  the  algorithm  is  simple,  the  method  is  not 
fully  validated  as  it  is  only  tested  on  a  toy  3-camera  network. 

Van  Den  Hengel  et  al.  [2007]  begin  by  assuming  that  the  vision  graph  is  fully  connected. 
Edges  that  contradict  observed  evidence  are  removed.  In  particular,  an  edge  between  two  regions 
is  removed  if  one  camera  observes  movement  while  the  other  does  not  (in  the  same  moment  in 
time).  The  authors  describe  an  efficient  algorithm  that  learns  the  near-optimal  topology  of  a  large 
100-camera  network  in  just  one  hour  of  footage.  The  algorithm  is  robust  as  it  is  not  affected 
by  varying  lighting  conditions,  camera  angles  and  size  of  moving  objects.  Some  examples  of 
matching  camera  views  are  shown  in  Figure  2. 

Devarajan,  Cheng  &  Radke  [2008]  estimate  the  vision  graph  by  matching  features  across  cam¬ 
era  views.  First,  each  camera  detects  a  set  of  distinctive  feature  points  in  its  image  that  are  likely  to 
match  other  images  of  the  same  scene.  Both  the  number  of  features  and  the  length  of  each  feature 
descriptor  are  summarised  in  a  fixed-length  structure  called  a  feature  digest.  Each  camera  matches 
its  own  feature  digest  with  those  of  other  cameras.  An  edge  in  the  vision  graph  is  established  if 
enough  matches  are  found.  The  feature  digest  of  an  image  is  based  on  the  popular  and  successful 
scale-invariant  feature  transform  (SIFT)  detector/descriptor  proposed  by  Lowe  [2004].  In  particu¬ 
lar,  the  feature  digest  is  a  compressed  subset  of  SIFT  features  that  arc  both  distinctive  and  spatially 
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Figure  2:  Groups  of  images  whose  views  are  overlapping.  The  edges  of  the  vision  graph  are 
shown  in  green. 


well  distributed  across  the  image.  The  paper  analyzes  the  tradeoffs  between  the  size  of  the  feature 
digest,  the  number  of  transmitted  features,  the  level  of  compression  and  the  overall  performance 
of  edge  generation. 


2.2  Communication  Graph  Estimation 

Marinakis  &  Dudek  [2006]  use  a  stochastic  version  of  the  Expectation-Maximization  algorithm 
to  learn  plausible  agent  trajectories.  The  approach  uses  only  detection  events  from  the  deployed 
sensors  (equivalent  to  motion  detection  in  video).  The  model  assumes  that  the  network  is  tra¬ 
versed  by  a  fixed  number  of  agents  (up  to  10),  which  travel  between  sensor  nodes,  as  well  as, 
an  external  sink  node.  Results  obtained  from  simulations  and  experimental  data  suggest  that  the 
technique  produces  accurate  topology  graphs  under  a  variety  of  conditions  and  compares  well  to 
other  approaches. 

Marinakis,  Giguere  &  Dudek  [2007]  present  a  simple  topology  estimation  algorithm  that  relies 
purely  on  the  order  of  detection  events  in  each  camera  (rather  than  their  timestamps).  The  key  idea 
is  to  find  the  smallest  graph  that  successfully  explains  the  observed  data.  Assuming  that  there  arc 
N  agents  in  the  environment,  the  algorithm  considers  all  possible  trajectories  (paths)  that  could 
be  taken  by  these  agents.  It  then  constructs  the  smallest  graph  that  can  explain  every  such  path. 
Interestingly,  when  the  problem  is  formulated  in  this  way  it  is  equivalent  to  set-covering,  which 
is  known  to  be  NP-complete.  The  authors  show  that  a  simple  greedy  heuristic  works  well,  even 
better  than  a  more  sophisticated  statistical  approach.  The  algorithm  is  accurate  on  small  simulated 
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networks.  However,  no  real  experiments  are  performed  and  problems  arise  when  N  is  smaller  than 
the  true  number  of  agents. 


2.3  Combined  Graph  Estimation 

Makris,  Ellis  &  Black  [2004]  exploit  temporal  correlations  in  observations  of  agents’  movements 
through  the  network.  The  authors  use  Expectation-Maximization  (EM)  to  learn  a  Gaussion  Mix¬ 
ture  Model  (GMM)  that  models  links  between  entry/exit  zones.  For  each  camera  view,  a  set  of 
entry/exit  zones  is  automatically  learnt  [Makris  &  Ellis  2003]. 2  A  cross-correlation  value  is  com¬ 
puted  for  each  possible  link  from  an  exit  zone  i  to  an  entry  zone  j.  If  the  cross-correlation  has 
a  cleai-  peak  then  there  is  a  real  link  between  i  and  j.  Additionally,  estimates  of  the  transition 
times  and  probabilities  can  be  extracted  from  the  cross-correlation.  If  a  link  is  detected  between 
the  zones  of  two  cameras,  then  the  two  cameras  are  either  adjacent  (in  the  communication  graph) 
or  overlapping  (edge  in  the  vision  graph).  In  particular,  the  two  cameras  arc  overlapping  if  the 
transition  time  is  approximately  zero;  otherwise  the  target  moves  through  an  unseen  path  and  so 
the  cameras  arc  adjacent  in  the  communication  graph.  The  experimental  results  look  promising. 

Correlation  is  effective  for  monotonic  relationships  in  general,  but  is  not  flexible  enough  to 
handle  multi-modal  distributions.  Such  relationships  can  occur,  for  example,  when  both  cars  and 
pedestrians  arc  paid  of  the  observations.  In  general,  the  more  dense  the  observations  and  the  longer 
the  transition  time,  the  more  false  correspondences  will  be  generated  by  the  method.  With  this  in 
mind,  Tieu,  Dailey  &  Grimson  [2005]  improve  the  approach  of  Makris,  Ellis  &  Black  [2004].  They 
use  more  flexible,  multi-modal  transition  distributions,  and  explicitly  handle  correspondence.  This 
is  accomplished  by  using  mutual  information  as  a  (more  general)  measure  of  statistical  dependence 
to  estimate  object  correspondence.  The  approach  makes  few  assumptions  and  does  not  require 
supervision. 


3  Avoiding  Topology  Estimation 

Camera  network  topology  estimation  is  a  non-trivial  task  that  remains  unsolved  for  a  large  num¬ 
ber  of  cameras  and  complex  activity  patterns,  such  as  those  in  a  crowded  public  scene.  Recently 
Wang,  Tieu  &  Grimson  [2010]  proposed  a  method  which  bypasses  the  topology  inference  and 
correspondence  problem.  They  use  Latent  Dirichlet  Allocation  (LDA)3  to  cluster  trajectories  into 
activities  and  model  paths  commonly  taken  by  objects  across  multiple  camera  views.  The  method 
has  few  restrictions  on  the  camera  topology,  the  structure  of  the  scene  and  the  number  of  cameras. 
Evaluation  is  performed  on  two  large  real  data  sets,  each  of  which  contains  more  than  14,000 
trajectories.  On  the  downside,  the  method  is  limited  to  learning  relationships  among  activity  pat¬ 
terns;  any  temporal  relationships  arc  not  discovered  automatically,  instead  they  arc  determined  by 
a  pre-defined  temporal  threshold. 

Loy,  Xiang  &  Gong  [  2009a  ]  model  the  dependencies  between  activities  across  camera  views 
with  a  time  delayed  probabilistic  graphical  model  (TD-PGM).  The  nodes  of  the  graphical  model 

2Naive  K-means  is  used  to  cluster  starting  and  ending  points  of  object  trajectories  into  entry/exit  regions  [Makris  & 
Ellis  2003], 

3LDA  is  a  new  standard  for  document  analysis.  The  model  uses  multivariate  beta  distributions  to  model  the  rela¬ 
tionships  between  words,  documents  and  topics. 
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represent  activites  in  different  semantically  decomposed  regions  from  different  camera  views, 
while  its  directed  edges  encode  causal  relationships  between  these  activities.  The  proposed  ap¬ 
proach  is  effective  in  a  9-camera  network  installed  in  a  busy  train  station  with  complex  and  diverse 
scenes,  such  as,  long  queues  at  a  ticket  office,  concourse,  train  platforms  and  escalators  (see  Figure 
3).  Incredibly  the  method  works  on  low-quality  320  x  230  video  running  at  a  mere  0.7  frames  per 
second.  Note  that  the  method  is  rather  complex,  requiring  a  two-stage  structure  learning  algo¬ 
rithm.  A  simpler  algorithm  solving  the  same  task  is  found  in  their  earlier  paper  [Loy,  Xiang  & 
Gong  2009/r] . 


Figure  3:  The  learned  activity  dependency  graph.  Edges  are  labeled  with  their  associated  time 
delays.  Regions  and  nodes  with  discovered  inter-camera  dependencies  are  highlighted. 


4  Discussion 

The  aim  of  the  INVS  project  for  the  year  2010-2011  is  to  demonstrate  a  live  camera  handover 
system  for  a  network  of  about  10  cameras  with  both  overlapping  and  non-overlapping  views.  To 
this  end,  we  believe  there  are  three  possible  options  for  this  project: 

1.  Model  both  the  vision  and  communication  graphs  independently. 

2.  Model  a  combined  graph  (see  Section  2.3)  that  allows  one  to  derive  the  vision  and  commu¬ 
nication  graphs. 

3.  Avoid  topology  estimation  entirely  (see  Section  3). 

As  we  move  down  the  above  list,  the  robustness  and  the  generality  of  the  methods  increase.  How¬ 
ever,  there  is  a  price  to  pay  —  the  methods  become  considerably  more  complex,  especially  in 
option  3. 

As  far  as  option  1  is  concerned,  the  construction  of  the  two  graphs  differ  widely.  The  construc¬ 
tion  of  the  vision  graph  can  be  considered  simpler  as  it  can  be  solved  by  matching  features  ( e.g ., 
movement,  SIFT)  across  camera  views.  The  communication  graph,  on  the  other  hand,  requires 
one  to  use  sophisticated  tools  to  model  transition  probabilities  and  time  delays. 

Overall  it  seems  that  option  2  is  best  as  it  achieves  a  good  balance  between  method  generality 
and  implementation  complexity. 
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